Onitilo ef al. BMC Medical Informatics and Decision Making 2014, 14:38 
http://www.biomedcentral.com/1472-6947/14/38 



Medical Informatics & Decision Making 



RESEARCH ARTICLE Open Access 



A novel method for studying the tennporal 
relationship between type 2 diabetes nnellitus 
and cancer using the electronic nnedical record 

Adedayo A Onitilo^'^'^", Rachel V Stankowski^, Richard L Berg^, Jessica M Engel'*, Gail M Williams^ and Suhail A Doi^ 



Abstract 

Background: We developed an algorithm for the identification of patients with type 2 diabetes and ascertainment 
of the date of diabetes onset for examination of the temporal relationship between diabetes and cancer using data 
in the electronic medical record (EMR). 

Methods: The Marshfield Clinic EMR was searched for patients who developed type 2 diabetes between January 1, 
1995 and December 31, 2009 using a combination of diagnostic codes and laboratory data. Subjects without 
diabetes were also identified and matched to subjects with diabetes by age, gender, smoking history, residence, 
and date of diabetes onset/reference date. 

Results: The final cohort consisted of 1 1,236 subjects with and 54,365 subjects without diabetes. Stringent 
requirements for laboratory values resulted in a decrease in the number of potential subjects by nearly 70%. Mean 
observation time in the EMR was similar for both groups with 13 — 14 years before and 5-7 years after the 
reference date. The two cohorts were largely similar except that BMI and frequency of healthcare encounters were 
greater in subjects with diabetes. 

Conclusion: The cohort described here will be useful for the examination of the temporal relationship between 
diabetes and cancer and is unique in that it allows for determination of the date of diabetes onset with reasonable 
accuracy. 
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Background 

The National Cancer Institute estimates that approxi- 
mately 13.7 million Americans with a history of cancer 
were alive on January 1, 2012 [1] with over 1.5 million 
additional cases diagnosed each year [2]. Diabetes melli- 
tus is even more prevalent, affecting 25.8 million people, 
or 8.3% of the population, in the United States [3]. Ac- 
cordingly, it is not uncommon for the same individual 
to be diagnosed with the both conditions, potentially 
compounding both illnesses [4,5]. Diagnosis of cancer 
may make management of diabetes more difficult or 
conversely, diabetes may be predictive of poorer cancer 
outcomes [4,6-9]. Understanding the relationship between 
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cancer and diabetes and the impact that one disease may 
have on the other may provide important insight regard- 
ing both health and survival and has become a key re- 
search priority. 

Recent studies have shed considerable light on the po- 
tential physiological and clinical relationship between 
diabetes and cancer [10-12]. Diabetes and cancer share 
several important risk factors and attempting to define 
the relationship between the two diseases is additionally 
confounded by demographic and lifestyle characteristics 
as well as exposure to diabetes medications [13]. Cancer 
tends to be somewhat easier to study using information 
available in the electronic medical record (EMR) and vari- 
ous cancer registries. Studies of diabetes generally prove 
to be more difficult. Diabetes develops gradually and is 
characterized by progressive insulin resistance and hyper- 
insulinemia during the pre-diabetes phase followed by 
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increasing hyperglycemia after clinical onset. Lifestyle 
modifications, medication use, and other treatment op- 
tions are not usually initiated until after clinical diagnosis 
and many patients with diabetes go unrecognized for long 
periods of time. Reliance on administrative data, which in- 
dicates only when diabetes was recognized and diagnosed, 
not necessarily when it began, has precluded careful tem- 
poral analyses of the relationship between diabetes and 
cancer. Even so, EMRs have served as an important data 
source in initial studies of the relationship between cancer 
and diabetes. Limitations of such studies include impreci- 
sion in capture of diabetes onset date, inaccuracies in elec- 
tronic data, difficulty in distinguishing between type 1 and 
type 2 diabetes, and biases inherent to retrospective and 
observational studies. Due in part to these limitations, the 
temporal and causal relationship between diabetes and 
cancer, if any, remains difficult to explore. 

Numerous individual studies and meta-analyses have 
yielded important information regarding cancer risk fol- 
lowing diabetes onset [10]. However, recent evidence 
suggests that the hyperglycemia characteristic of overt 
diabetes may be less important in promoting cancer risk 
than the hyperinsulinemia characteristic of the pre- 
diabetes phases [13,14]. Due to the powerful effects of 
insulin as a growth factor and the potential for hyperin- 
sulinemia to impact cancer development, a number of 
studies have attempted to correlate insulin levels with 
cancer risk, finding some effect for certain cancer types 
[15-18]. However, little attention has been paid to the 
pre-diabetes phase specifically in patients known to pro- 
gress to diabetes, and the long-term temporal relation- 
ship between the two diseases remains unclear. The 
purpose of this paper is to describe a unique method for 
determining date of onset of type 2 diabetes, even when 
onset of disease occurs prior to clinical recognition. This 
algorithm leverages the EMR to draw upon clinical, ad- 
ministrative, and laboratory data to accurately pinpoint 
the date of diabetes onset, exclude potential subjects 
with type 1 diabetes, and examine additional confound- 
ing factors, such as glycated hemoglobin (HbAlc) levels 
and medication exposure. Our study algorithm and 
methods are described in detail and compared to those 
published by other authors. Limitations and potential 
biases are also discussed. 

Methods 

Marshfield Clinic is a multi-specialty, regional healthcare 
system in Wisconsin, USA. The Marshfield Clinic EMR 
contains data dating back to the 1960s and provides 
comprehensive information regarding all encounters 
with the Marshfield Clinic and cooperating hospitals, in- 
cluding St. Joseph's Hospital in Marshfield, WI. In 2007, 
Wilke et al. [19] published an electronic algorithm for 
identifying patients with diabetes mellitus in the EMR. 



However, this algorithm was focused on a specific subset 
of Marshfield Clinic patients enrolled in the Personalized 
Medicine Research Project (PMRP) and could not accur- 
ately pinpoint date of clinical diabetes onset. In the 
present study, we took this algorithm a step further and 
developed matched cohorts of patients with and without 
type 2 diabetes who received care at the Marshfield 
Clinic to retrospectively examine the temporal relation- 
ship between diabetes and three different types of can- 
cer, including breast, prostate, and colon cancer, as well 
as medication exposure and glycemic control. The study 
was approved by the Marshfield Clinic Scientific Review 
Committee and the Institutional Review Board and a wai- 
ver of subject consent was granted [study ID — ONI10711/ 
78037]. 

Subject selection 

Patients diagnosed with type 2 diabetes between January 
1, 1995 and December 31, 2009 were eligible for inclusion 
in the study. All potential subjects were required to be 
30 years of age or older by the end of the study period and 
could not have any diabetes-related diagnoses or medica- 
tion use prior to the study period. The pool of potential 
subjects was then divided based on whether or not they 
had any diabetes-related diagnostic codes during the study 
period. Patients with one or more diabetes-related codes 
during the study period comprised the pool of potential 
subjects for the cohort with diabetes. Patients with no 
diabetes-related diagnoses prior to the end of the study 
period comprised the pool of potential subjects for the co- 
hort without diabetes. 

For study purposes, type 2 diabetes was defined using 
a combination of diagnostic codes and laboratory results 
(Figure 1). Data were collected electronically from the 
Marshfield Clinic EMR and cancer registry. Data valid- 
ation was performed in an iterative fashion and included 
both electronic screening (e.g., graphical view of labora- 
tory results, identification of seeming discrepancies be- 
tween laboratory values and diagnoses) and manual 
review. Subjects manually reviewed included two se- 
quential random samples (90 and 84 cases, respectively), 
with the selection stratified electronically with respect to 
prevalence of diabetes and cancer, calendar year, and lo- 
cation of residence. These samples were manually vali- 
dated by search of the patient medical record for the 
presence and incident dates of diagnosis for diabetes and 
cancer through utilization of text records, diagnosis 
description codes, pathology reports, and review of 
diabetes-related laboratory values and medications. Val- 
idation results were used in refining the cohort defini- 
tions. Subjects with diabetes had at least one diagnostic 
code for type 2 diabetes mellitus (International Classifi- 
cation of Disease, version 9 (ICD-9) 250.X0 or 250.X2), 
and this code was required to precede any code for type 
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Electronic diagnostic code for type 2 diabetes 
mellitus (DM) 



AND 



At least 2 liigli HbAlc or glucose tests within 3 
years of DM diagnostic code 



HbAlc 
> 6.5% 



OR 



Fasting glucose 
> 125 mg/dL 



OR 



Random glucose 
> 200 mg/dL 



AND 



At least 1 normal HbAlc or glucose test prior 
to, but within 3 years of DM diagnostic code 



HbAlc 
< 6.5% 



OR 



Fasting glucose 
< 126 mg/dL 



OR 



Random glucose 
< 200 mg/dL 



Reference Date for DM Onset 



earliest of DM diagnostic code or second high lab 



Figure 1 Algorithm for defining type 2 diabetes. Type 2 diabetes 
was defined using a combination of diagnostic and laboratory data. 
Laboratory results indicative of diabetes were based on American 
Diabetes Association criteria [20]. 



1 diabetes by at least one year. In addition, subjects with 
diabetes were required to have at least two abnormal la- 
boratory test results for glycated hemoglobin (HbAlc) 
or glucose [HbAlc > 6.5% (48 mmol/mol), fasting glu- 
cose > 126 mg/ml, or random glucose > 200 mg/ml] with 
the second being no more than three years prior to first 
type 2 diabetes diagnostic code, and at least one nor- 
mal HbAlc or glucose test prior to, but within three 
years of first diabetes diagnostic code [HbAlc < 6.5% 
(48 mmol/mol), fasting glucose < 126 mg/ml, or ran- 
dom glucose < 200 mg/ml]. Laboratory criteria for type 

2 diabetes were based on American Diabetes Associ- 
ation (ADA) criteria [20]. The date of diabetes onset 
was defined as the earlier of the first type 2 diabetes diagno- 
sis by diagnostic code or the second high diabetes-related 
laboratory value. Requiring that both a normal result and 
the beginning of abnormal tests occurred within three 
years helps ensure capture of diabetes onset within that 
three-year window and excludes subjects with long-term, 
undiagnosed diabetes. Additionally, subjects treated with 
diabetes medications > 30 days before diagnoses were ex- 
cluded. For the cohort with diabetes, the date of diabetes 
onset was considered the reference date. Subjects without 



diabetes were also verified by laboratory values and clin- 
ical data. Potential subjects without diabetes with no nor- 
mal glucose or HbAlc test, as defined by the ADA, and 
those treated with diabetes medications prior to the end 
of the study period were excluded. 

Subject matching 

Potential subjects without diabetes were frequency matched 
at a 5:1 ratio with subjects with diabetes based on date of 
birth (five categories), smoking history (ever/never), resi- 
dence (ever/never) in the Marshfield Epidemiologic Study 
Area (MESA, a geographic region consisting of 14 ZIP 
codes in the primary service area of Marshfield Clinic 
[21]), and diabetes diagnosis/reference period (1995 - 
1999, 2000 - 2004, or 2005 - 2009). Matching variables 
were selected based on the potential for effects on cancer 
risk and healthcare seeking behaviors. Additional baseline 
characteristics were accounted for via statistical adjust- 
ment. Initial subject selection required a minimum of 
60 days observation time in the study period meaning that 
subjects had to have visits within the Marshfield Clinic 
system spanning at least the 60 days after the reference 
date. An actual reference date was assigned for each sub- 
ject without diabetes by randomly sampling from the sub- 
set of subjects with diabetes in the same stratum, thereby 
ensuring that the distribution of possible observation times 
for subjects without diabetes corresponded precisely with 
those of their matched subjects with diabetes (Figure 2). 

Data collection 

Data sources included Marshfield Clinic's comprehensive 
EMR system and cancer registry [19]. Data were col- 
lected electronically and verified through manual chart 
abstraction of targeted samples. Reference dates for all 
subjects fell within the 15 year study period from 1995 
through 2009, with follow-up through 2011 and observa- 
tion before the reference date as far back as the patient's 
history in the Marshfield Clinic EMR. Based on the need 
for extensive follow-up information, subjects were re- 
quired to have received sufficient care through the 
Marshfield Clinic system so that diagnosis dates for dia- 
betes and/or breast, prostate, or colon cancer could be 
determined with reasonable accuracy. All subjects 
were required to have at least one non-diabetes diag- 
nosis or electronic code documenting a well-visit 
from a Marshfield Clinic provider in at least one of 
the three calendar years prior to the reference date. 
Observation times were censored prior to any large 
gap in the EMR, which was defined as four or more 
consecutive calendar years. 

Cancer diagnoses required two documented diagnoses 
by ICD-9 code within the EMR. The first date on which 
the ICD-9 code was used was considered the date of 
cancer diagnosis and data were merged with data from 
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Subject Selection and Matching 



Age > 30 by end of study period (2009), no 
diabetes-related diagnoses before study 
period begins (1995) 
(n = 624,293) 



Diabetes 



No Diabetes 



Subjects with > 1 diabetes-related codes 
during study period 
(n = 86,433) 



Subjects with no diabetes-related codes 
during study period 
(n = 437,860) 



Application of inclusion criteria; > 2 high 
HbAlc or glucose tests* within 3 years 
before or after DM diagnosis AND > 1 
normal HbAlc or glucose test** before 
the abnormal test, but within 3 years prior 
to DM diagnosis 
(n = 12,153) 



Application of inclusion criteria: > 1 
normal HbAlc or glucose test** during 
study period 
(n = 259,160) 



Exclusion of patients treated with 
diabetes medications > 30 days before 
diagnosis and determination of date of 
DM onset (reference date) 
(n = 11,236) 



Exclusion of patients treated with 
diabetes medications during study period 
(n = 255,670) 



Subjects divided into 40 categories by 
gender, birth date (5 categories), smoking 
history (ever/never), and MESA residence 

(ever/never). Each category further 
divided into three 5-year time periods*** 
based on diagnosis date, resulting in a 
total of 120 categories, 
(n = 11,236) 



Subjects divided into 40 categories by 
gender, birth date (5 categories), smoking 
history (ever/never), and MESA residence 
(ever/never) 
(n = 255,670) 



Subjects with and without diabetes were frequency matched at a 1:5 ratio in the 120 categories 
created by birth date, gender, smoking status, residence, and time period, with matching by 
time period for subjects without diabetics based on a minimum of 60 days observation within 
the time period. Subjects without diabetes were subsequently assigned reference dates by 
selecting dates randomly with replacement from the corresponding category of subjects with 
diabetes. This ensured that the final distribution of reference dates for subjects without diabetes 
reflected the distribution of diagnosis dates in subjects with diabetes, 
n = ll,236diabetic subjects (5,423 female); n = 54,365 non-diabetic subjects (26,346 female) 



'HbAlc > 6.5% OR fasting glucose > 126 mg/dl OR random glucose > lOOmg/dl 
"HbAlc < 6.5% OR fasting glucose < 126 mg/dl OR random glucose < 200 mg/dl 
"•1995-1999, 2000-2004, and 2005-2009 

Figure 2 Subject selection and matching. Subject selection and matching process for defining cohorts of patients with and without type 
2 diabetes. 



the Cancer Registry to validate diagnoses and provide 
additional information. Several covariates with the poten- 
tial to influence cancer risk were also examined, including 
comorbidities and clinical risk factors, as well as used of 
chemotherapy and radiation during cancer treatment. 
Cancer treatment data were only available for subjects in 
the local cancer registry, which limited analyses using 
these data. Comorbidities of interest included myocardial 
infarction, coronary heart disease, peripheral vascular dis- 
ease, cardiovascular disease, chronic pulmonary disease, 
rheumatic heart disease, and renal insufficiency/renal 



failure, which were summarized using a modified Charlson 
score (excluding cancer and diabetes). Comorbidities were 
established by interrogating the EMR for relevant diagnos- 
tic codes, requiring at least two documented diagnoses in 
the subject's medical record. The EMR was also interro- 
gated for body mass index (BMI), smoking history, and in- 
surance status at reference date as well as frequency of 
healthcare visits before and after the reference date. For 
subjects with diabetes, exposure to three classes of dia- 
betes medications including insulin, metformin, and sulfo- 
nylurea drugs was ascertained. 
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Results 

The process of participant selection and matching is 
summarized in Figure 2. Of note, apphcation of our al- 
gorithm including laboratory parameters to the pool of 
potential subjects with diabetes resulted in exclusion of 
approximately 70% of patients. Less than 10% of poten- 
tial subjects were lost when those with other diabetes- 
related diagnoses or abnormal glucose values > 3 years 
prior to diabetes diagnosis were excluded. An additional 
40% of remaining potential subjects were excluded be- 
cause they did not have at least two high glucose or 
HbAlc levels, and another 50% of potential subjects 
were excluded because they did not have a normal 
HbAlc or glucose value recorded within 3 years prior to 
diabetes diagnosis. After application of inclusion and ex- 
clusion criteria, there were 11,236 patients included in 
the final cohort with diabetes. After assigning reference 
dates, 54,365 participants without diabetes remained. 
Losses in the matching process resulted in a final matched 
cohort with 4.8 subjects without diabetes for each patient 
with diabetes, rather than the target ratio of 5:1. Despite a 
smaller final sample size we believe a more defined cohort 
is likely to be more informative and better suited for ana- 
lysis than a less well-defined and refined larger cohort. 

In our final validation sample, we manually abstracted 
evidence of diabetes diagnosis in 70 patient charts. If a 
diabetes diagnosis was present (N = 50), we verified the 
date with laboratory values for HbAlc and glucose, of- 
fice notes, and medications listed. Prior records were 
checked to ensure that the diagnosis had not been men- 
tioned previously but not coded. In patients in whom no 
diagnosis of diabetes was evident (N = 20), we verified 
the absence of any diabetes diagnoses on problem lists, 
verified that there were no high HbAlc or glucose levels, 
verified that no diabetes medications were listed, and 
that diabetes was not mentioned in the notes for a re- 
cent office visit or history and physical. In this validation 
sample, the observed predictive value for control sub- 
jects (NPV) was 100% (20/20). It is important to note 
that cases can always become controls, and this was 
observed in one control subject who developed dia- 
betes in 2011—7 years after the assigned reference date in 
2004 — but this has no bearing on algorithm validity. The 
predictive value for case status (PPV) was 96% (48/50), 
with two subjects appearing to be incorrecdy identified. 
However, upon arbitration, one of the two subjects was 
found to have a diagnosis of diabetes during the study 
period, increasing the positive predictive value to 98%. 
Overall sensitivity of the algorithm for detecting type II 
diabetes was 96% (95% CI 86.3-99.4%) and overall spe- 
cificity was 95% (95% CI 75.1-99.2%). The date of dia- 
betes onset determined by manual chart review was 
within 6 months of the study-assigned date of onset in 
over 70% of subjects with diabetes. 



Descriptive statistics of subjects with and without dia- 
betes are shown in Table 1. Mean observation in the EMR 
was similar for both groups with approximately 16 years 
before the reference date and 6-7 years after the reference 
date. The cohorts with and without diabetes were largely 
similar, except that BMI was higher in subjects with dia- 
betes, and visit frequency suggested that patients with dia- 
betes tended to have more frequent contact with the 
healthcare system, even before onset of diabetes. Table 1 
also shows the number of patients in each cohort with a 
diagnosis of breast, prostate, or colon cancer. 

Discussion 

Several studies have examined the influence of diabetes 
on cancer risk and the general consensus suggests that 
diabetes increases cancer risk, with the notable excep- 
tion of prostate cancer [10]. Diabetes is a progressive 
disease and physiological changes begin to occur long 
before clinical onset of disease [22]. During the pre- 
diabetes phase, patients undergo a prolonged period of 
increasing insulin resistance and hyperinsulinemia that 
ultimately results in the progressive hyperglycemia char- 
acteristic of diabetes itself. Recent evidence suggests that 
the hyperinsulinemia characteristic of the pre-diabetes 
phase is more important for promoting cancer risk than 
the hyperglycemia present after clinical onset [13,14]. 
Despite this evidence, examining cancer risk in the pre- 
diabetes phase is difficult and the temporal relationship 
between the two diseases has remained largely unex- 
plored. We developed an electronic algorithm that calls 
upon administrative, laboratory, and clinical data to accur- 
ately identify patients with type 2 diabetes and to deter- 
mine the date of clinical onset for over 10,000 patients. A 
cohort without diabetes was also generated and includes 
over 50,000 patients with assigned reference dates. To- 
gether, the cohort of approximately 65,000 patients with 
over 16 years of follow-up after diabetes onset and 6-7 
years of observation before provides a resource for the 
temporal examination of the relationship between diabetes 
and cancer risk. 

Previous studies regarding the relationship between 
diabetes and cancer relied on algorithms to identify dia- 
betes that were heavily reliant on self- report and/or ad- 
ministrative data, limited in their ability to distinguish 
between type 1 and 2 diabetes, and unable to accurately 
determine date of diabetes onset [23,24]. Several algo- 
rithms have recently been developed to make use of 
EMR data to accurately identify patients with type 2 dia- 
betes primarily for surveillance purposes. Kuydakov 
et al. [25] developed an algorithm for identifying newly 
diagnosed cases of type 2 diabetes with the main criter- 
ion being a minimum 30-day window between the first 
documented visit in the EMR and entry of type 2 dia- 
betes in the problem list. However, this method fails to 
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Table 1 Subject descriptive characteristics by type 2 
diabetes status 



Table 1 Subject descriptive characteristics by type 2 
diabetes status (Continued) 



Variables 



Diabetes 
(N = 11,236) 

N (%) 



No diabetes 
(N = 54,365) 

N (%) 



Gender 

Male 

Female 
Mean age (years) (IQR) 
Age group 

30-49 years 

50-59 years 

60-69 years 

70-79 years 

>80 years 
Smoking status 

Ever 

Never 

Diabetes diagnosis period 

1995-1999 

2000-2004 

2005-2009 
MESA residency 

No 

Yes 

Mean BMi (kg/m^) (iQR) 

Have insurance 

Visit frequency during 2 years 
before diabetes 

0-5 

6-10 

11-20 
>20 

Visit frequency during 2 years 
after diabetes 

0-5 

6-10 

11-20 

>20 

Mean observation time (IQR) 
Years Before Diabetes onset 
Years After Diabetes onset 

Comorbidities 
Myocardial infarction 
Coronary heart disease 
Peripheral vascular disease 
Cardiovascular disease 



5813 (51.7) 
5423 (48.3) 
62.9 (53-73) 

1940 (17.3) 
2681 (23.9) 
3110 (27.7) 
2409 (21.4) 
1 096 (9.8) 

7579 (67.5) 
3657 (32.5) 

2486 (22.1) 
4657 (41.4) 
4093 (364) 

9036 (804) 
2200 (19.6) 
33.4 (28-37) 
8881 (79.0) 



2285 (203) 
2367 (21.1) 
3146 (28.0) 
3438 (30.6) 



1175 (10.5) 
1523 (13.6) 
3223 (28.7) 
5315 (47.3) 

16.6 (61-261) 
74 (4.4-10.0) 

208 (1 .9) 
590 (53) 
272 (2.4) 
351 (3.1) 



28019 (51.5) 
26346 (48.5) 
63.2 (53-72) 

9760 (18.0) 
12832 (23.6) 
14294 (263) 
10711 (19.7) 
6768 (12.4) 

36427 (67.0) 
1 7938 (33.0) 

12123 (22.3) 
22581 (41.5) 
19661 (362) 

43871 (80.7) 
1 0494 (1 9.3) 
28.9 (26-39) 
40852 (75.1) 



1 5848 (29.2) 
13093 (24.1) 
1411 (26.0) 
1 1 308 (20.8) 



21396 (39.4) 
1 0286 (1 8.9) 
11492 (21.1) 
11191 (20.6) 

1 6.3 (5.7-262) 
61 (2.8-8.8) 

609 (1.1) 
1380 (2.5) 
984 (1.8) 
1290 (2.4) 



Chronic pulmonary disease 


1 069 (9.5) 


3106 (5.7) 


Rheumatic heart disease 


200 (1.8) 


999 (1.8) 


Renal disease 


207 (1.8) 


713 (1.3) 


Cancer 






Breast^ 


543 (10.0) 


2282 (8.7) 


Colon 






Men^ 


1 73 (3.0) 


642 (2.3) 


Women^ 


1 29 (2.4) 


548 (2.1) 


Prostate"^ 


600 (10.3) 


2832 (10.1) 



IQR, interquartile range; IVIESA, Marshfield Epidemiologic Study Area; BMI, 
body mass index. 

^Women only, N = 5,423 with diabetes and 26,346 without diabetes. 
^Men only, N = 5,813 with diabetes and 28,019 without diabetes. 



account for the potential for subjects to have diabetes 
for long periods of time before a diagnosis is made. 
Other algorithms have used additional EMR data to cap- 
ture patients with diabetes even before diagnosis using 
various combinations of laboratory results and prescrip- 
tion information in addition to diagnostic and billing 
codes [19,26-29]. While these algorithms perform well 
for their intended purposes, none serve to accurately 
identify the date of diabetes onset and are thus unsuit- 
able for examination of the temporal relationship between 
diabetes and cancer, which requires clear delineation of 
the time periods before and after diabetes onset. The 
method described here focuses on determining date of 
clinical diabetes onset as accurately as possible using a 
combination of laboratory results and diagnostic codes in 
addition to time limits to ensure capture of onset within a 
three-year time window. Table 2 offers a side-by-side 
comparison of these algorithms. The element of time is of 
particular importance to the algorithm designed for the 
present study. For a patient to be classified as having dia- 
betes, abnormal laboratory values (HbAlc or glucose) 
were required to occur within three years of a normal la- 
boratory values, suggesting that clinical onset of diabetes 
occurred in the interim. Date of onset was defined as the 
earlier of the first diabetes diagnosis code or second high 
laboratory result, clearly delineating the time periods be- 
fore and after diabetes onset for temporal examination. 
Validation of a similar Marshfield Clinic EMR-based algo- 
rithm for identification of patients with and without type 
2 diabetes showed a 99% predictive value for type 2 cases 
and 98% for type 2 controls [30]. Results were similar 
using the algorithm described here with a 98% predictive 
value for type 2 cases and 100% for type 2 controls. 

Despite meticulous selection of patients with and with- 
out diabetes for study inclusion, our cohort is nevertheless 
subject to biases inherent to retrospective and observa- 
tional studies as well as certain time-related biases 



Onitilo ef al. BMC Medical Informatics and Decision Making 2014, 14:38 
http://www.biomedcentral.com/1472-6947/14/38 



Page 7 of 9 



Table 2 Comparison of algorithms using electronic medical record data for identification of patients with type 2 
diabetes 



Reference 



Billing/Diagnostic codes 



EMR elements 
Laboratory results^ 



Medications 



Timeframe Diabetes onset 



[23] > 1 short-stay hospital, skilled nursing facility, 

or home health agency claim or > 2 
physician/supplier claims with diabetes diagnosis 

[24] 1 hospital discharge abstract or 2 physician 

services claims showing diabetes 

[1 9] 250.X0, 250.X2, 357.2, 362.0X, 583.81 



[25] 250.X0, 250.X2, or 362.XX (no insulin) 

[25] On problem list 

(coded or free text) > 2 times in 2 years 

[27] 250.x 

[28] > 2 250.X0 or 250.X2 

[29] No type 1 code, > 2 type 2 codes 

Current Study > 1 250.X0 or 250.X2, > 1 year before any 
type 1 code 



> 1 high HbAlc or 
random glucose or > 2 
random glucose tests 

> 1 high HbA1cor>2 
high fasting or random 
glucose tests 

> 2 high fasting glucose 
tests in 1 year or any high 
HbAlc^ 

High HbAlc, fasting, or 
random glucose^ 

High HbAlc or fasting 
glucose test 

Abnormal glucose or 
HbAlc 

> 2 high HbAlc or 
glucose test and > 1 

normal HbAlc or glucose 
test 



Metformin, 
sulfonylurea, or insulin 



Hypoglycemic 
medications 



Number unique anti- 
diabetes medications 

nsulin or ora 
hypoglycemic agents 
except metformin 

Type 2 diabetes 
medications 

Excluded for diabetes 
medication > 30 days 
before diagnosis 



1 - 2 year 
identification 
period 

2 year period 



Diagnosis > 
30 days after 
first office visit 



First 
appearance in 
problem list 



Surveillance 
in real time 



Normal and Earliest of first 
abnormal labs diagnosis or 
within 3 years second high lab 



HbAlc, glycated hemoglobin. 

^ Laboratory values coincide with American Diabetes Association guidelines (HbAI c > 6.5%, fasting blood glucose > 1 26 mg/dl, random blood glucose > 200 mg/dl) 

[20] unless otherwise specified. 

^Fasting blood glucose > 7 mmol/L, HbAlc > 7%. 

^HbAlo 6%, fasting or random glucose > 1 10 mg/dl. 



common in observational studies. In addition, the data 
available in an EMR are only as good as the data input 
during routine patient care. As such, for healthcare 
systems in which several healthcare choices are avail- 
able nearby, laboratory values and diagnoses captured 
outside of the healthcare system may not be available. 
Marshfield Clinic serves a relatively rural, agriculture- 
based population with little turnover and little choice 
of healthcare provider. As such, our EMR serves as a 
robust source of data, but we recognize that certain 
data points may be missing. Ascertainment bias is in- 
herent to retrospective studies. In the current cohort, 
ascertainment bias may result from the fact that pa- 
tients with diabetes have more frequent contact with 
the healthcare system. Additionally, HbAlc screening 
was not recommended by the ADA until 2010, after 
the study period, and in the 3 years prior it is estimated 
that only 10-20% of adults without diabetes underwent 
HbAlc testing [31], which may introduce an additional 
source of ascertainment bias. As with other laboratory 
tests, methods for measuring HbAlc have also changed 
over time. Use of reference period as a matching 



criterion in cohort development is likely to minimize 
the effects of any such change, however. Similarly, se- 
lection bias may result from the unintentional selec- 
tion for differing characteristics among patients who, 
for example, receive a particular diabetes treatment. 
Additionally, labs drawn to assess HbAlc and glucose 
levels may be more likely to be performed in patients 
with a higher risk for cancer, of concern in both 
groups. The effects of selection bias are minimized to 
some extent by the cohort design, which uses an exten- 
sive matching process to account for age, gender, resi- 
dence, smoking history, and reference period. In future 
work using the cohort described here, the potential for 
additional confounding as a result of selection bias will 
be minimized via proportional hazards regression model- 
ing with adjustment for relevant covariates. Time-related 
biases, including immortal time bias, time-window bias, 
and time-lag bias, will be of particular concern when con- 
sidering the effect of exposure to diabetes medications on 
cancer risk [32]. While elimination of such biases may not 
be realistic, efforts to minimize their effects include use of 
time-varying analyses, assessment of follow-up time, and 
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examination of both exposure and duration in analyses of 
diabetes medications. Importantly, reference period was 
used as a matching criterion in cohort development and 
follow-up time before and after the reference date or date 
of diabetes onset was similar. 

Conclusion 

Murdoch and Detsky recently reported on the inevitable 
application of the massive amount of data captured by 
the EMR to health care, emphasizing the potential value 
of using information generated in the course of routine 
care to answer important questions and to improve the 
quality of care [33]. Here we demonstrate an example 
using data abstracted electronically from the EMR to de- 
velop patient cohorts for the careful examination of the 
temporal relationship between diabetes and cancer. To 
date, the cohort described here has been used to examine 
the temporal relationship between diabetes and breast 
cancer in women [34], prostate cancer in men [35], and 
colon cancer [36], as well as the effects of glycemic control 
and medication exposure on cancer risk [37]. In the fu- 
ture, we plan to use this cohort to examine tumor severity 
and survival as well as the effects of additional disease 
conditions and comorbidities, such as sleep apnea, on can- 
cer risk in patients with diabetes. 
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